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PROCESS FOR MAINTAINING ONGOING query. The list of URL results is returned to the user, with the 

REGISTRATION FOR PAGES ON A GIVEN format of the returned list varying from engine to engine. 

SEARCH ENGINE Usually it will consist of tea or more hyperlinks per search 

engine page, where each hyperlink is described and ranked 
FIELD OF TOE INVENTION 5 for relevance by the search engine by means of various 

^ , . , . . r J 1 • information such as the title, summary, language, and age of 

TTie present invention relates o the process of developing ^^^^^^^ ^^^^^^^ hyperlinks arc typically sorted by 

and maintainmg the content of Internet search engine data- relevance, with the highest rated resources near the top of 

the list. 



BACKGROUND OF THE INVENTIOISr 10 



The World Wide Web consists of thousands of domains 
and millions of pages of information. The indexing and 

An internet (including, but not limited to, the Internet, cataloging of content on an Internet search engine takes 

intranets, extranets and similar networks), is a network of large amounts of processing power and time to perform, 

computers, with each computer being identified by a unique With millions of resources on the web, and some of the 

address. The addresses are logically subdivided into content on those resources changing rapidly (by the day, or 

domains or domain names (e.g. ibm.com, pbs.org, and even minute), a single search engine cannot possibly main- 

oranda.net) which allow a user to reference the various ^^i" ^ perfect database of all Internet content. Spiders and 

addresses.Aweb, (including, but not Kmited to, the World ^^^^^ agents are continuaJly indexing and re-indexing 

Wide Web (WWW)) is a group of these computers acces- ^WW content, but a smgle World Wide Web site may be 

sible to each other via common communication protocols, or ^ ' ^h^^^ be visited again for months 

languages, including but not limited to Hypertext Transfer ^he queue of sites the search engine must index grows. A 

Protocol (HTTP). Resources on the computers in each ^te owner can speed up the process by manually requesting 

domain are identified with unique addresses called Uniform '^'^ resources on a site be re-mdexed, but this process can 

Resource Locator (URL) addresses (e.g.http:// get unwieldy for large web sites and is m fact, a guarantee 

www.ibm.com/products/laptops.htm). A web site is any des- ° ...... 

tination on a web. It can be an entire individual domain, J^^^y '^"^^^f,^ search engines support two meth- 

rr,„it^«u Ar.rr..;r.. o .;..rr^^ TTDT controllmg thc resourcc files that are added to their 

multiple domains, or even a single URL. j . . rn. 1. . * . 1:1 l- i_ • -j 

„ ^ , ^ ^ ...... database- These are the robots.txt file, which is a site-wide, 

?u ""l"™*" « '■T^- '^^^•''^=5,""* » j'!™ search engine specific control mechanism, and the ROBOTS 

or."html URL sufex are text flies, or pages formatted m a HTML tag which is resource file specific, but not 

%^n;^'« ^ ""^^ Hypertext Markup Language 3^ search engine specific. Most internet search engines respect 

(HTML). HTML is a coUecUon ol tap used to mark blocks ^oth methods, and will not index a file if robots.txt, 

of text and assign meamng to them. A specia ized computer rqBOTS META tag, or both informs the internet search 

application caUed a browser can decode the HTML files and -^^ ^^^^ ^ resomcc. The use of robots.txt, the 

display the information contained withm. A hyperlink is a roboTS META tag and other methods of index control is 

navigable reference m any resource to another resource on 35 advocated for the purposes of the present invention, 

the internet. . . . . , Commonly, when an internet search engine agent visits a 

An internet Search Engme is a web apphcation consisting of ^-^^ indexing, it firet checks the existence of robot- 

1, Programs which visit and index the web pages on the g.txt at the top level of the site. If the search agent finds 
internet. robots.txt, it analyses the contents of the file for records such 

2, A database of pages that have been mdexed 40 as: 

3, Mechanisms for a user to search the database of pages. User-agent: * 
Agents are programs that can travel over the internet and Disallow: /cgi-bin/SRC 

access remote resources. The internet search engine uses Disallow: /stats 

agent programs called Spiders, Robots, or Worms, among The above example would instruct all agents not to index 

other names, to inspect the text of resources on web sites. 45 any file in directories named /cgi-bin/SRC or /stats. Each 

Navigable references to other web resources contained in a search engine agent has its own agent name. For example, 

resource are called hyperlinks. The agents can follow these AltaVista (currently the largest Internet search engine) has 

hyperlinks to other resources. ITie process of following an agent called Scooter. To allow only AltaVista access to 

hyperlinks to other resources, which are then indexed, and directory lavstufif, the following robots.txt file would be 

following the hyperlinks contained within the new resource, 50 used: 

is called spidering. User-agent: Scooter 

The main purpose of an internet search engine is to Disallow: 

provide users the ability to query the database of internet User-agent: * 

content to find content that is relevant to them. A user can Disallow: /avstuff 

visit the search engine web site with a browser and enter a 55 The ROBOTS META tag is found in the file itself. When 

query into a form (or page), including but not limited to an the intemet search engine agent indexes the file, it will look 

HTML form, provided for the task. The query may be in for a HTML tag like one of the following: 

several different forms, but most common are words, <META NAME="ROBOTS" CONTENT="NOINDEX, 

phrases, or questions. The query data is sent to the search NO FOLLOWS 

engine through a standard interface, including but not lim- 60 <META NAME-"ROBOTS" CONTENT-"NO INDEX, 

ited to the Common Gateway Interface (CGI). ITie CGI is a FOLLOW*'> 

means of passing data between a client, a computer request- <META NAME-^'ROBOI^" CONTENl «"1NDEX, NO 

ing data or processing and a program or script on a server, FOLLOWS 

a computer providing data or processing. The combination <META NAME-"ROBOTS" CONTENT-"INDEX, FOL- 

of form and script is hereinafter referred to as a script 65 LOW"> 

application. The search engine will inspect its database for INDEX and NOINDEX indicate to all agents whether or 

tlie URLs of resources most likely to relate to the submitted not the file should be indexed by that agent. FOLLOW and 
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NOFOLLOW indicate to all agents whether or not they with an exception based, distributed processing system, 

should spider hyperlinks in this document. Instead of maldng the search engine do all the work neces- 

For current internet search engines, the present invention sary to index a site, the web site owner is now responsible 

process uses the CGI program(s) provided by the search for that operation. By distributing the work, the search 

engine in order to add, modify and remove files from the 5 engine is improved in these ways: 

search engine index. However, the process can generally ^ xhe search engine can maintain perfect ongoing reg- 

only remove a file torn the search engine mdex if the file no ^^^^^^^ indexing of pages by re-indexing at a set 

longer exists or if the site owner (under the direction of the .^^^^^j^ frequently as the web site owner chooses, 

process) has configured the site, through the use of _ „^ . . . , * i , 1. 

u * /* T^rvortTx? TijrcT-A * ^ ♦u tu j„ f 2. Ine search engine can maintam anmteUigent database, 

robots.txtjthe ROBOTS META tag or other methods of ,„ . , . , i_ 1 . . * 1 

. . ' 1 .1. * .1. u • -11 «u CI not lunited by the conditions that automated agents 

mdex control, so that the search engine will remove the file . . / 1.1 . ur u 

- . J ° have imposed on them and not easily corruptible by 

from its index. , , ♦u- 1 *• 

~_ , ^. - ^. . ^ c 4 •* ' web site owners with less ethical practices. 

The duration of time between the first time a site is ^ ^ , . _ ^ ^ • • 

indexedandthenexttimethatinformationisupdatedhasled ^- search engine provides a guarantee of mtegnty to 

to several key problems: , , ^It^^^ely providmg a more valuable service 

. . *u*- j-r^j to both users and web site owners. 

A. A resource that is modified or removed by its owner . , u ^ * u * * r u 
..... J . . . 1 J . • The process IS begun by distributing a set 01 search engine 

after it is mdexed by a search engine could be incor- ^ . . T * X u % i-u .1 

.1 1- ^ J • *t- . L ' c *u *-i Update software tools to the web Site owner inese tools can 

rectly listed m that search engme for months until an , , , , . _ , ru c . 

. . ^ • . u be implemented in one of three ways. The first way is to 

agent visits the site to register the change. implement the tools on the web server of the site owner. ITie 

B. A resource may be modified since the last tune it was 20 software can run automatically, having direct access to all 
indexed, m which case a user may never be directed to ^^^^^^^^^ ^-^^ i^^^^U 
the new content, or mcorrecUy directed to content that ^^^^^^^ ^^^^ ^ ^^^^^^^^^ ^^^^^g^^^ ^ 
is no longer present. computer with proper permissions and access to the 

C. Deleted resources can create the impression for a resources of the web site and automatically accesses those 
search engine user that a whole web site has shut down, 25 resources over the network. The third way is through the use 
that the information the user is looking for is removed, client-side tools. The software will run on each client^s 
or that the web site is not being mamtamed, when the computer, check the client's web server via internet 
resources may have simply been moved to another protocols, and relay the information on the web server to the 
location on the site as part of regular site maintenance. search engine. 

D. Automated tools such as search engines apply their 30 -j^^ software could be written in a variety of dificrcnt 
own criteria in order to determine the relevancy of a programming languages and would be applicable for as 
particular resource for a particular query. These auto- Qjany client and server computers as needed. 

mated criteria can lead to the search engine returning Upon initial execution, the software builds a database of 

spurious, misleading, or irrelevant results to a particular t^e resources on the web site. The resources catalogued can 

query. For example, a recent search for the nursery 35 specified by the user, or automatically through spidering 

rhyme "Rub a dub dub, three men in a tub" on a functions of the software. The database consists of one 

particular search engine resulted in the top ten search record per resource indexed on the site. Each record contains 

results containing discussions of various issues among fields including: 

consenting males. ^ '1^^ search engines the owner of the web site would 

E Automated agents are not always able to understand the 40 ^-j,^ resource to be indexed by. 

context of the pages they index, as illustrated by the g j^e date and time of the last index by each search 

example above. As such, their one-dimensional capa- engine. 

bilities allow web masters to create the impression that ^ ^^^^ ^ ^^^^^^ j^^^ modi^^ 

the resources on a particular site comain mformation according to the local indexing engine, 

they do not. This is done to direct trafiic to sites by 45 ^ .^^.^^^^ ^^^^^^^ ^ ^^^^^^^ ^ 

providing incorrect or misleading information, a pro- updating, inclusion, or removal from a particular search 

cess called spammmg. engine database. 

F. Most automated agents are incapable of processmg the ^ ^^^^ subsequem execution the software tools 

content of resources that are bmary m nature, such as . ^ ^^^^^^ ^^^^^ ^-^^ .^^ ^^^^^^^ 

apphcations wntten m the programmmg language Java. 50 database. When altered, removed, or additional con- 

These applications can display text data, but do not use ^^^^ ^ ^^^^^^ ^^^^^^ ^^^^ appropriate 

text or HTML files to do so. Instead, the information is ^.^^ database and then notify the search engine of 

encoded in binary form in the application. As such an ^^^^ . ^ Box206a, 2mb-c). Changes to 

agent cannot determine the content ot a resource coded database are made as follows: 

m this manner. , • . u A. A resource is marked as deleted if the resource is listed 

llie present invention provides a mechanism for search ^^^^^ ^^^^^^ ^^^^.^^^^ 

engine and web site managers to maintain as pertecl a ^ ^ . , , j-^i ■, -r .t. j * j 

. , r u . * « • ui 13., ♦ B. A resource is marked as modified if the date and time 

registration of web site content as is possible. By augment- ^, j-c - • .j.i - i- 

ing or replacing existing agents and manual registration f ^^^^ modification m the current database is earlier 

methods with specialized tools on the local web site (and, 60 than the date and tmie of last modification provided by 

when feasible, at the search engine), the current problems ^^^^ server for the resource, 

with search engine registration and integrity can be eUmi- C. A resource is added and marked as added if it is present 

jj^j^d "^^^ server, but not yet in the database and the 

web site manager has opted to add it either manually or 

SUMMARY OF THE INVENllON automatically. 

The present invention defeats the key problems with Through application of the present invention, the foUow- 

automated agents and manual registration and replaces them ing improvements are made in search engine administration: 
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1. Tlie task of spidering the web site has been distributed 
to the web site owner (see FIG. 1, Box 205c). 

2. The web site owner has the capability to protect brand 
image from being injured by a search engine pointing 
potential visitors to deleted, irrelevant, or incorrect 5 
resource information. 

3. The search engine owner has a higher degree of 
database integrity. I^s information storage space is 
wasted on spurious, nonexistent or incorrect data. 

4. The web site owner can directly indicate the keywords 
and other descriptions that are most appropriate for 
each resource in the site, as opposed to using the 
cumbersome HTML 'Meta* tag to specify the keywords 
for the agent. Keywords are words that are particularly 
relevant to a particular resource and might be used on 
a search engine to locate that resource. 

5. The search engine can create a reverse index of 
keywords that the individual site owners have identified 
for each resource. For example, a user could query for 
a list of all web sites that have listed 'dog' as an 
appropriate keyword. 

6. The internet search engine could be used by users to 
query the content of a particular web site, as opposed 

to requiring a web site based search engine to index the 25 
content. This saves administration effort and computing 
resources at the web site. 

ITie main aspect of the present invention is to provide a 
method to index locally at a web site all changes to that site's 
resource content database which has occurred since the last 30 
search engine indexing. 

Another aspect of the present invention is to actively 
transmit said changes to an internet search engine. 

Another aspect of the present invention is to automatically 
transmit batches of updates (a list of content that has 35 
changed since the last search engine index), in a predeter- 
mined manner. 

Other objects of this invention will appear from the 
following description and appended claims, reference being 
had to the accompanying drawings forming a part of this 40 
specification wherein like reference characters designate 
corresponding parts in the several views. 

BRIEF DESCRIPTION OF THE FIGURES 

FIG. 1 is a flowchart of the steps to select which search 45 
engines will receive updates and which files shall be updated 
on those search engines 

FIG. 2 is a diagram of the decision tree for determining 
the state of a specific resource on a particular search engine 
database, and the action needed to update the internet search 50 
engine as enabled in FIG. 1. 

FIG. 3 is a diagram of the Internet search engine update 
process of updating the files as in FIG. 1 and resources 
defined by FIG. 2. 

Before explaining the disclosed embodiment of the 55 
present invention in detail, it is to be understood that the 
invention is not limited in its application to the details of the 
particular arrangement shown, since the invention is capable 
of other embodiments. Also, the terminology used herein is 
for the purpose of description and not of limitation. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

The present invention can be used on new Internet search 
engine systems, or existing systems can be adapted for use 65 
by existing search engines having the following character- 
istics: 



1. The search engine provides a Common Gateway Inter- 
face to allow resources to be added to, modified, or 
deleted from the search engine database. 

2. The search engine can update the database index 
quickly (ideally immediately) in response to additions, 
modifications, or deletions information provided 
through the CGI. 

3. The search engine can keep the date and time it last 
indexed a page (or alternatively, the last modification 
date and time of the page when it was last indexed) and 
can make this information available to the web site 
owner. 

In addition, if a search engine allows search resuhs to be 
constrained to one particular site, that completes the func- 
tionality requirements of the present invention. 

The technical effort required to apply the present inven- 
tion to existing Internet search engines is similar to that 
required to apply the invention to a new search engine. The 
most complex instance would be to apply the invention to a 
range of search engines, some of which have been designed 
with the invention in mind, some of which have not. The 
aforementioned instance will be assumed here. 

As implemented, the invention is a server-side process, 
running either on a surrogate server or the actual server upon 
which the web site is stored. The process is coded as a 
program in the Perl programming language, although other 
languages such as C++ or Java could be ased. The process 
is invoked regularly by the operating system of (he computer 
on which the program resides or manually by a web site 
manager. 

As such, there are three main areas of the preferred 
embodiment that need to be understood. Tliey are: 

I. The implementation and construction of the server side 
tools, which consist of the database and tools to update 
the database. 

II. The process by which the database is constructed and 
updated. 

III. The process by which a search engine is updated by 
a site using this process. 

I. The Implementation and Construction of the Server Side 
Tools, Which Consist of the Database and Tools to Update 
the Database 

Installation of the software tools places a number of CGI 
scripts, database tables, and HTML forms on the server. 
Each element performs a specific function relevant to the 
process and is outlined below. Initially, there is a database 
Table of Search Engines, containing an entry for each 
Internet search engine. The table below illustrates the format 
of a typical search engine record. 



Field 


Type 


Default 


Description 


Name 


String 


None 


The name of the search engine 


Enabled 


Boolean 


Thie 


Whether the search engine is to be 








informed of changes to content 


Tabic of 


Tabic 


None 


Database table of files indexed on this 


Files 






site and for which changes must be 








tracked 


Register by 


Boolean 


True 


Whether to register a resource on this 


defnntt 






s«irch engine in the absence of explicit 








information provided by the site 








manager 


Max 


Integer 


None 


The maximum number of registrations 


registrations 






allowed per day by this search engine 
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-continued 



Field 



lype Default Description 



Limit to site Boolean None 



Lists index Boolean None 
date 

Lists index Boolean None 
time 

Index time Integer None 



Supports 

me 

lookup 



Boolean None 



Whether the search engine allows 
searches to be restricted to one web 
site only 

Whether the search engine will report 
the date a resource was last indexed 
Whether the search engine will report 
the time a resource was last indexed 
Typical delay between registration time 
and indexing of a site by the search 
engine 

Whether the search engine will allow a 
particular file to be searched for 
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The user is provided with an HTML form and CGI script, 
hereinafter referred to as a CGI program, in order to con- 
figure the Enabled and Table of Files fields (see FIG, 1, Box 
100-101). The information the user inputs is submitted over 20 
the Common Gateway Interface (FIG. 1, Box 102) and the 
referenced CGI script updates the database tables as 
instructed (FIG. 1, Box 103-105). The user can thus enable 
(i.e., select) and disable a particular search engine using this 
interface. A search engine thai is disabled in the database is 
simply skipped during an update. 

The Table of Files is a field in the Table of Search Engines 
database. It is initially configured by the user through a CGI 
program (FIG. 1, Box 200) to list the files the user wishes to 
be registered with this search engine. This table contains a 
record for each resource. Each record contains the following 30 
fields: 
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Field 



lypc Default Description 



Name 
To Be 
Registered 
To Be 
Un- 
registered 
Date and 
time last 
registered 
Register 



String None 
Boolean False 

Boolean False 



Date 

and 

Time 

Enum 

CTrue, 

False. 

By 

default) 



By 

default 



The URL of the resource 
Whether the resource needs to be 
registered with this search engine 
Whether the resource needs to be 
unregistered (removed) from this search 
engine 

Date and time the file was last registered 
with the search engine 

Whether the site manager wants the tile to 
be registered on this search engine. The 
*By default' value indicates to follow the 
value of the 'Register by default' field of 
the search engine record of the database 



The Table of Files is a list of the above records. The list 
is built by first obtaining the set of resources the user wishes 50 
to maintain and register with a search engine (FIG. 1, Box 
201). The user enters the files they wish to monitor into a 
CGI program and submits the form (FIG. 1, Box 203a-c, 
Box lOAa-c). The form allows the user to choose from many 
methods of building the Table of Files. These methods 55 
include, but are not limited to: 

A. The user may list all the resources to be registered 
manually. These listed resources are added to the Table 
of Files (FIG. 1, Box 202fl, 205fl). 

B. The user may specify a map page. If the user specifies 60 
a map page, this map page is retrieved. All of the 
hyperlinked resources on the map page referring to this 
web site are added to the Table of Files (FIG. 1, Box 
2026, 2056, 2066). 

C. The user may specify entry points to the web site. If the 65 
user specifies entry points, the CGI program will enter 
the site and spider to all resources referenced on those 
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entry points, adding those resources to the Table of 
Files (FIG. 1, Box 202c, 205c, 207c). 

The list of pages built by the above process forms the 
Name fields of the Table of Files records for each search 
engine. This process can be performed globally (on all 
search engines in the table of search engines), on a group of 
search engines, or on an individual search engine, as indi- 
cated by the user (FIG. 1, Box 206«, 2076, 207c). 

Submitting the above form also invokes a CGI script to set 
the Enabled and 'Register by default' fields of the appro- 
priate search engine record according to the preferences of 
the user. Additionally, a page is provided where the title, 
URL and Meta Description of each page would be substi- 
mted in the appropriate place in the tabic for each search 
engine. 

Submitting this additional information invokes a CGI 
script to set the Register field of the Table of Files field for 
the appropriate search engine record, according to prefer- 
ences of the user. 

II V. The Process by Which the Database is Constructed and 
Updated 

The process now looks up each file and determines 
whether the file is registered, cunent, out of date, or deleted 
with respect to its registration on the search engine. 

There arc eight possible states for the file to be in with 
respect to its registration. In order for the process to be 
deterministic, all random spidering activity by the search 
engine is ignored in determining the state of the file. The 
state is determined purely by the current registration and the 
data the process has stored in the database of activities 
performed by previous invocations of itself. 

FIG. 2 illustrates the decision process to determine the 

state of a resource on the search engine (Box 1) and the 
action, which must be taken. A resource can be in the 
following states: 
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Deleted (2a) The resource no longer exists on the web site. If the 
resource exists in the search engine database, an 
en^or is signaled. 

Awaiting The resource is not in .state 2a. The resource should 

ittdexiag (2b) shortly be ijidexed by the search engine and should not 
be registered now. 

Out of The resource is not in state 2a, 2b . . . The resource is not 

date (2c) due to be indexed by the search engine, but has been 

modified since it was last indexed by the search engine. 
Well The resource is not in state 2a, 2b, 2c. The resource has 

registered not been modified since last indexed and its listing 

(2d) on the search engine is conect. 

Wrongly The resource is not in state 2a, 2b, 2c, 2d. 'ITie resource 

registered is listed on the search engine, but the web site manager 
(2e) does not want it to be. 

Wrongly The resource is not in state 2a, 2b, 2c, 2d, 2e. The web 

unregistered site manager wishes the resource to be regLstered by the 
(2f) search engine, but the resource is not registered by the 

search engine or due to be indexed by the search engine. 
Correctly ITie resource is not in state 2a, 2b, 2c, 2d, 2e, 2f. The 
unregistered resource is not registered, not due to be indexed, and 
(2g) the user does not wish it to be. 

Will be The resource is not in state 2a, 2b, 2c, 2d, 2e, 2f, or 2g. 

indexed in The resource is not listed by the search engine and the 
error (2h) site manager does not wish it to be. However, the 

file will shortly be indexed by the search engine and the 
site configuration currently would not prevent this. 
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The following are the actions to be taken in each state (see 
FIG. 2): 



Deleted (3a) 



Awaiting 
indexing (3b) 
Out of 
date (3c) 



Well 

registered 

(3d) 

Wrongly 
registered 
(3e) 

Wrongly 

unregistered 

(3f) 

Correctly 
unregistered 

(3g) 
Will be 
indexed in 
error (3h) 



The resource no longer exists on the web site. The 
process atten^ts to remove the resource entry from the 
search engine database with a CXxI program provided by 
the engine for this purpose (4a). 
No action is taken. 

The resource has been modified since it was last indexed 
by the search engine. The process attempts to register 
the resource for rc-indcxing with CGI program provided 
by the engine for this purpose. 
No action is taken. 



The process attempts to remove the resource entry from 

the search engine index using a CGI program provided 

by the search engine for this purpose. 

The process attempts to add the resource to the search 

engine index using a CGI program provided by the 

search engine for this purpose. 

No action is takra. 



The web site manager is warned though the process 
teponing mechanism (e-mail, a web page, or other 
method) that the manager does not want the resource to 
be indexed, but the search engine will shortly index it 
and there are no safeguards in place to prevent this. 
Site manager can take appropriate steps to avoid 
registration (4b) or registration will tsike place (4c). 
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20 



The following psuedo code indicates the necessary steps in 
programming which must be taken determine the state of a 
resource and take the appropriate action. 



30 



35 



For each enabled search engine in DatabaseLookup(table of 

search engines) 

list of files - search engine.table of files 
If search engine.limit to site 

search engine flies = SearchEngineLookup(a]l files 

reported by search engine for this site) 
list of files = list of files + search engine files 

End If 

For each file in list of files 

last index date time « GetIndexDateTune(file, search engine) 
If FileExisLs(file, list of files) 

If search engine.table of files. file.toberegistered 

RegisterFile(file, search engine) 
Next For [each file in list of files] 
End If 

last modification date time - 

C etLastModificationDatc'nme(file) 
will be indexed = WillBeIndKcd(file, search engine, 

last index date time) 
should be registered - ShouldBeRegi5tered(file, 

search engine) 
If last index date time not found 
If should be registered 

If last modification date time > 
last index date time 
If will be indexed 

AddReportC^waiting 
indexing", file) 

Else 

AddReport("out of date", 
file) 

RegisterFile(file, 
search engine) 

End If 

Else 

AddReport("well registered"", 
file) 

End If 
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50 



55 
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-continued 



Else [File is registered but should not be] 
AddReport("wrongly registered", file) 
UnRegisterFile(file) 

End If 

Else [File is not registered] 
If should be registered 

AddRcpoitif'correctly unregistered", file) 
RegisterFile(fUe, search engine) 

Else 

If will be indexed 

AddReport("wil] be indexed in error", 
file) 

Else 

AddReport("wen unregistered", 
file) 

End if 
End If 
End If 
Else [File Does not exist] 

AddReport("deleted", file) 

If last index date time !- not found 

UnRegisterFile(file, search engine) 
End If 
End If [FUe Exists] 
End For 
End For 
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ni. The Process by Which a Search Engine is Updated by a 
Web Site Using This Process 

There are three ways the process may update a search 

engine: 

1. It can register a resource in an attempt to have that file 
added to the search engine database (FIG. 3, Box 104). 

2. It can register a resource in an attempt to update the 
resource's hsting in the search engine database (FIG. 3, 
Box 105). 

3. It can unregister a resource in an attempt to remove the 
file from the search engine index (FIG. 3, Box 103). 

In practice, these three activities are usually performed by 
the same CGI program on current search engines. ITiis CGI 
program is the * register file' program and is run manually by 
the user or automatically (FIG. 3, Box 100). An HTML form 
is provided for the purpose of adding a resource to the search 
engine index. On submitting the form, a CGI script is 
invoked. The most common mode of action for this script is 
as follows: 

1. If the file exists (FIG. 3, Box 101), the search engine 
determines whether the configuration of the web site 
will allow indexing through robots.txt and/or ROBOTS 
Meta Tag (FIG. 3, Box 104). If the file does not exist 
and the file has been registered by the search engine 
(FIG. 3, Box 101, 102), it is removed immediately from 
the search engine database index (FIG. 3, Box 103). 

2. If the site can be indexed, the search engine determines 
if the resource is registered by the search engine. If the 
resource is registered, the search engine determines if 
the resource has changed since it was last indexed (FIG. 
3, Box 109). If the resource has changed since it was 
last indexed, the resource entry in the search engine 
database is updated with new data (FIG. 3, Box 109, 
HO). If the resource has not changed since it was last 
indexed, then no action is taken. (FIG. 3, Box 111). If 
the site can not be indexed, and the resource has been 
indexed by the search engine (FIG. 3, Box 105), the 
entry for the resource is removed from the search 
engine database (FIG. 3, Box 106). 

3. In a case where the site can be indexed and the resource 
does not exist in the search engine database, the 
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resource URL is added to a list of URLs the search 
engine will index (FIG. 3, Box 108). Some search 
engines will index resources submitted in this way 
within a day or two of submission. Other search 
engines may take weeks or months. 5 
The Following Psuedo Code Illustrates the Above Pro- 
cesses: 



On RegisterFile(file, search engine) 

Check that the file is appropiiaie for the search ej^ine 
IC file is appropriate or IsRegistered(fUe, search engine) 
If file is not appropriate 

AddReport("inappropriate file registered", file) 
End If 

If! (file in Databa5eLookup(search engine, table of files)) 

AddFiIeTbDatabase(search engine, file) 
End if 

If SeaichEngijieRegistraticnsOK(file, search engine) 
ScarchEngincRegisterFiIe(file) 
If file registered OK 

search engine.table of files.file.date last 

registered - today's date 
search enginctable of files. file.timc last 

registered - now 
AddReporl("file registered", file) 
search enginctable of files, 
file.toberegistered » fiilse 

Else 

AddReport("Registratzon failed", file) 
search engine.table of files, 
filctobeiegistered = true 

End if 

Else 

AddReport("registration delayed", file) 
search engine.table of files. file, 
toberegisteied - true 

End if 

Else 

AddReportCregistration &i]ed ~ inappropriate file", file) 
End if 
End RegisterFile 

On UnRcgistcrFile(file, search engine) 
SearchEngineUnRegisterFile(file) 
If file unregistered OK 

AddReport(*'file unregistered", file) 

search engine.table of files. file. tobeunregistcred = false 

Else 

AddReport("Unregistration failed", file) 
search engine.table of files.file.tobeunregi5tered « true 
End if 
End UnRegisterFile 
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Tlie present invention would: 

1. Significantly improve the quality of a sites registration 
on a range of search engines. Out of date registrations 
and registrations pointing at deleted files would be 
quickly cleaned up. Utu'egistered files that the site ^0 
owner wanted registered would be quickly registered, 
and currently indexed files that the site owner wanted 
removed from the index would quickly be removed. 
Registration would always be within the rules of each 
search engine to which the process was applied. 55 

2. Provide a new method for search engines to gather and 
distribute information. The process works best when 
the search engine and site owner cooperate for mutual 
benefit. The search engine should offer the following 
features in order for the process to work most effi- 60 
ciently: 

a. Provide confirmation that a particular file is in the 
index. 

b. Provide the date and time the file was indexed or 
guarantee immediate indexing 65 

c. Provide the current date and time according to the 
search engine index 



d. Provide a means to add a file to the index (ideally 
immediately) 

e. Provide a means of removing a file from the index 
(ideally immediately) 

f. Impose no practical hmit on the number of files that 
may be registered within a fixed period 

g. Provide a means of restricting searches to a particular 
site through a hidden field in the search CGI, the 
state of which is maintained on each page delivered 
by the search engine. Once a site has a perfect 
ongoing registration on a powerful search engine, 
that search engine is perfect for searches within that 
site. 

The following functions are describe further the above 
processes. 



On DatabaseLookup(table of search engines) 

return table of search engines 
End DatabaseLookup(table of search engines) 
On DatabaseLookup (search engine, table of files) 

retain table of files(search engine) 
End DatabaseLookup(search engine, table of files) 
On AddFQeToDatabase(search engine, file) 

table of filcs(search engine) += file 
End AddFileToDatabasc(search engine, file) 
On SearchEngineLookup(all files reported by search engine for site) 

list of files «() 

page number = 1 

site links = SearchEngineCTetPage(scarch cngin6,sitc, page number) 
while number of site links > 0 

list of files += site links 

increment page number 

site links « SearchEngineGetPage(search engine, 

site, page number) 

end while 
return list of files 

End vSearcbEngineI^okup(all files reported by search engine for site) 
On FileExists(file, list of files) 
If file is local 

Perform stat of file 

return stot.exisLs 

else 

Perform HTTP head request of file 
If head request indicates that file exists 
Return file exists 

else 

Return file not exists 
end if 
end if 
End FileExists(file) 
OnGetLastModificationDate(file) 
If file is local 

Perform stat of file 

return stat.LastModificatioaDate 

else 

Perform HTTP head request of file 
return responscLastModifiedDate 

end if 

End GctLastModificatioaDate(file) 
On GetIndcxDateTime(file, search engine) 
If search engme.lists index date 

If search engine supports file lookup 
If(!LookupFile(search engine, file)) 
last index date time » not found 

Else 

last index date time lookup.date 
If search engine.lists index time 

last index date time lookup.time 
End if 
End If 

Else 

last index date time " not found 
For each phrase in file 

While GctNcxtSearchEnginePage(searcb engine, 

phrase) 

If search engine page lists file 



